To open this information in an interactive Colab notebook, click or scan the QR code below.
Data scientists should have an intentional process for tracking changes in their code over time.
- This process is known version control.
In ancient times, like the early 2000s, we simply renamed our files manually:
mycode.Rmycode_v2.Rmycode_final.Rmycode_reallyfinalv2.R
In modern times, version control is automatic in my contexts.
- Microsoft Word stores versions of a file each time it is saved. We can restore old version if we make a mistake.
- Dropbox automatically saves versions of files each time they are changed. Users can restore old versions of files within a certain time period (30 days to 1 year depending on a user’s subscription level).
The industry standard for version control of computer code and software is git. We will learn how to use git from within R Studio while using GitHub to host our code and version control data.
What is git?
Software developers need a fast, reliable way of tracking changes to their source code when working as individuals or as part of a team.
The best-in-class approach to software version control is git, which was created by Linus Torvalds in 2005 to improve on the version control applications available at the time.
From https://git-scm.com/:
Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.
By default, git tracks files changed in a directory or repository locally, i.e., on our actual computer, which can create some potential challenges:
- git does not backup our code or changes, which would be problematic if our computer becomes compromised..
- A centralized repository must be available when multiple contributors are working on the same project. There needs to be a way of easily managing each person’s contributions and keeping track of the development and stable branches of the code.
GitHub is the most popular solution for these two issues.
What is GitHub?
GitHub (https://www.github.com) is an online platform for hosting git repositories.
GitHub makes it easy to:
- Control who is able to access and change a repository.
- Track bugs in our code.
- Keep record of feature requests.
- Delegate responsibilities.
- Maintain a wiki.
- etc.
GitHub is a great place to:
- Backup and maintain version control of our software.
- Make our software public and allow others to contribute and interact with our work.
- Demonstrate our coding skills and abilities by creating publicly available project repositories.
- Host a professional website for potential employers or clients.
Linking RStudio and GitHub
We can use RStudio to manage a local git repository and push the changes to GitHub.com.
We outline the setup process below.
We assume that both R (https://cran.r-project.org/) and RStudio (https://posit.co/downloads) are installed.
Create a GitHub account
First, we create a GitHub account.
- Navigate to https://github.com
- Click Sign Up.
- Choose an email address to associate with your GitHub account.
- Educational users with a
.eduemail addresses can upgrade for free to a GitHub Pro account.
- Educational users with a
- Choose a strong password.
- Select a username.
- Think long term about this username. It may be an account that follows you for much of your professional life!
- Good choices are related to your name.
- E.g., if my name is John Q Smith, some good options might be
jsmith,jqsmith,john-smith,john-q-smith, etc.
- E.g., if my name is John Q Smith, some good options might be
- Bad choices are not professional or unique.
- E.g.,
jsmith198451,i-love-cereal,PartyKing.
- E.g.,
- Continue the registration process.
Install git
Next, we install a version of git on our computer.
- It might be wise to verify whether git is already installed on your computer.
- Skip to the “Verifying communication between git and RStudio” section below to do this before following the instructions below.
Download and install the appropriate version of git from https://git-scm.com/downloads.
In general, it is best to install git using all the default settings.
Verifying communication between git and RStudio
We need to make sure that RStudio is aware of the location of the git program on our computer.
- Navigate to Global Options (under Tools for Windows machines and currently under Edit → Preferences for a Mac).
- Click Git/SVN.
- Note whether RStudio has automatically discovered the file path of the git executable in the “Git executable” field.
If the Git executable field is empty or wrong, then you will need to figure out the issue. You can try:
- Reinstalling git.
- Use a different approach for installing git.
- Manually locating the git installation on your compute and using that file path directory.
- Use web searches to see other solutions that have worked for people.
To verify that RStudio has the correct path to git:
- Open a new Terminal window by pressing
Alt + Shift + Ron a Windows computer orOption + Shift + Ron a Mac. - Type
git --versionin the Terminal window and press enter. - If RStudio is looking in the correct location for git, then the current install version of git will be printed in the Terminal window.
Using version control for an R Studio Project
There are different ways to manage the version control workflow from within R.
We present the most basic workflow, which is to enable version control for a new project cloned from an existing GitHub repository.
We will do this by creating a new, empty repository, but you can do the same thing with an existing repository as long as you have the url for the repository you want to clone.
- Create a new repository on GitHub by clicking New on on your GitHub homepage.
Name your new repository, make it public, click Create Repository. We will create a new repository named “mytest”.
Copy the url for the repository.
Next, create a New Project using version control.
- Click File → New Project.
- Click Version Control.
- Click Git.
- Paste the url in the “Repository URL” field, click Create Project.
If everything is connected correctly, the newly cloned project should include a Git tab in the Environment pane.
The basic workflow when using RStudio to do version control is:
- Add/edit files in the current project.
- Add the appropriate files to the staging area.
- Commit the changes to the local git repository and provide a short description of the changes made.
- Push the changes from the local repository to a remote repository (GitHub).
We provide a simple example.
Change some code
- Open a new R Script (
Ctrl + Shift + Non a Windows computer orCmd + Shift + Non a Mac). - Add a single command to the first line of the file:
print("Hello, world!"). - Save the file as
test.R.
Stage the changes
In the Git tab of the Environment pane:
- Check the box next to “test.R”.
- Click the Commit button.
Commit the changes to the local repository
- Review changes.
- Add a Commit message.
- This is a brief description of the changes made in the commit.
- Click Commit.
Push the changes to the remote repository
After you close the Commit window, click Push in the Git tab of the Environment pane to push the changes from the local repository to GitHub.
The changes should be pushed to your GitHub repository, which you can see if you go to that repository on your GitHub page.
- We can see that
test.Rhas been added to our “mytest” repository on GitHub.
Adding RStudio version control to a local RStudio project
Suppose we want to use version control for a local RStudio project that isn’t already linked to a git repository.
A local RStudio project can be created in two ways.
Creating a New Project with a New Directory
We first discuss creating a new RStudio project with a new directory.
- Create a New Project by first clicking File → New Project.
- Select New Project.
- Select New Directory.
- Name the new directory using the Directory name field (“test2” in our case), uncheck the “Create a git repository” box (we will do this manually), and click Create Project.
Creating a New Project from an existing directory
We now discuss creating a new RStudio project from an existing directory.
- Create a New Project by first clicking File → New Project.
- Select Existing Directory.
- Browse to the existing folder we want to turn into a project and click Create Project.
Adding version control
To add GitHub version control to an existing RStudio project, we need a place to store the remote repository.
First, create a new repository on GitHub.com to store the project.
- Usually, the repository name is the same as your project name, but it doesn’t have to be.
- In our example, the project name and GitHub repository name are both “test2”.
Next, open a Terminal window in RStudio and run the following commands:
- Run
git initto initialize git for this project.- You only need to do this if no git repository is already initialized.
- If you check “Create a git repository” when creating a New Project then you can skip this step.
- Run
git add --allto stage all project files. - Run
git commit -m "initial commit"to make our first commit. - Run
git branch -M mainto declare the branch (“main” is the default). - Run
git remote add origin https://github.com/username/reponame.gitto link the remote repository to the local repository.- Replace
usernameandreponamewith the correct text.
- Replace
- Run
git push -u origin mainto push the commit from the local repository to GitHub.
In order the commands are:
git init
git add --all
git commit -m "initial commit"
git branch -M main
git remote add origin https://github.com/username/reponame.git
git push -u origin main
If you close and reopen the project in RStudio, then we should see the Git tab in the Environment pane.
Final thoughts
This is a small tutorial for using RStudio for version control with git and GitHub.
If you run into problems, https://happygitwithr.com/ is a a more in-depth resource for solving your problems.